For this project I created my own dataset. I used the Yahoo Finance API and Python to scrape the S&P 500 company list, the Russell 2000, and the the S&P 400 to get a mix of small, mid, and large cap stocks. I then selected 1000 companies at random from this aggregated list. I then used the Yahoo API and Finnhub to get the financial data for the companies and put it into a CSV. Github Link to process: .
The objective of this project is to develop a predictive model that estimates the Recent Close Price of stocks based on various market and financial variables. I hope to answer the following questions. 1. What are the most significant predictors of stock prices? 2. How do market variables interact to influence stock prices? I want to see which variables have interaction effects. 3. Can we predict stock price trends accurately? I want to see if we can get a good R^2 value that means we can use historical data values to predict current prices.
By developing and validating this model, the goal is to generate insights into the economic drivers of stock prices, which can potentially inform investment strategies or financial decision-making.
Ticker: Identifier for the stock (not used as a predictor). Sector: Categorical variable indicating the sector. Industry: Categorical variable indicating the industry. Region: Categorical variable indicating the region.(not used as a predictor because “North America” for all) Market Cap Classification: Categorical variable indicating the market cap classification. Volatility Classification: Categorical variable indicating the volatility classification. Growth vs Value: Categorical variable indicating growth or value classification. P/E Ratio: Quantitative variable representing the price-to-earnings ratio. Dividend Yield (%): Quantitative variable representing the dividend yield. Beta: Quantitative variable representing the stock’s volatility relative to the market. Avg Volume: Quantitative variable representing the average trading volume. Recent.Close.Price: Quantitative variable representing the most recent closing price. (target) EPS: Quantitative variable indicating the portion of a company’s profit allocated to each share. Revenue: Quantitative variable indicating the company’s revenue. Net Income: Quantitative variable indicating the net profit. Debt-to-Equity Ratio: Quantitative variable comparing a company’s total liabilities to its equity ROE: Quantitative var calculating how much profit a company generates with shareholder money P/B Ratio: Quantitative var calculating how much profit a company generates with shareholder money Free Cash Flow: Quantitative var comparing a company’s market value to its book value. Analyst Ratings: Categorical Variable that I got from finnhub API based on ratings from analysts Insider Transactions: Categorical Variable (finnhub API) that indicated insider transactions ## R Libraries and Helper Functions
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidyr)
library(dplyr)
library(Boruta)
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: iterators
## Loading required package: parallel
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(olsrr)
##
## Attaching package: 'olsrr'
##
## The following object is masked from 'package:datasets':
##
## rivers
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:olsrr':
##
## cement
##
## The following object is masked from 'package:dplyr':
##
## select
library(margins)
library(corrplot)
## corrplot 0.95 loaded
library(effects)
## Use the command
## lattice::trellis.par.set(effectsTheme())
## to customize lattice options for effects plots.
## See ?effectsTheme for details.
# i like using helper function because it makes my code more modular and thats how i write my CS programs
#this is to calculate optimal number of bins for a histogram
helper.calculate_bins_sturges <- function(data) {
n <- nrow(data)
bins <- ceiling(log2(n) + 1)
bins
return(bins)
}
# this is a helper that can plot either a one variable histogram or many variables from the same dataset through facet_wrap
helper.plot_histograms <- function(data, color = "blue", bins = NULL, variable_name = NULL) {
if (is.null(bins)) {
bins <- helper.calculate_bins_sturges(data)
}
if (!is.null(variable_name)) {
ggplot(data, aes(x = .data[[variable_name]])) +
geom_histogram(bins = bins, fill = color, alpha = 0.7) +
theme_minimal() +
labs(title = paste("Distribution of", variable_name))
} else {
#another debugging
if (!all(c("Variable", "Value") %in% names(data))) {
stop("Data must contain 'Variable' and 'Value' columns for multi-variable plotting.")
}
ggplot(data, aes(x = Value)) +
geom_histogram(bins = bins, fill = color, alpha = 0.7) +
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Predictors")
}
}
#box plot basic helper using ggplot
helper.create_boxplot <- function(data, x_var, y_var = NULL, fill_color = "blue", title_prefix = "Latest Closing") {
if (!is.null(y_var)) {
ggplot(data, aes_string(x = x_var, y = y_var)) +
geom_boxplot(fill = fill_color, alpha = 0.7) +
theme_minimal() +
labs(title = paste(title_prefix, y_var, "by", x_var), x = x_var, y = y_var)
} else {
return (Boxplot(as.formula(paste("~", x_var)), data = data, col = fill_color))
}
}
# helper function for summary for categorical/factor variables
helper.group_summary <- function(data, group_var, target_var) {
data %>%
group_by(across(all_of(group_var))) %>%
summarise(
mean = mean(.data[[target_var]], na.rm = TRUE),
sd = sd(.data[[target_var]], na.rm = TRUE),
n = n()
) %>%
arrange(desc(mean))
}
# helper function for anova for basic EDA of factor_variables
helper.run_anova <- function(data, factor_var, target_var) {
formula <- as.formula(paste(target_var, "~", factor_var)) #s/o CS131
anova_result <- aov(formula, data = data)
summary(anova_result)
}
helper.eda_factor <- function(data, factor_var, target_var, color){
#this gives the boxplot
print(helper.create_boxplot(data, factor_var, target_var, color))
#now annova
print(helper.run_anova(data, factor_var, target_var))
#finally summary
helper.group_summary(data, factor_var, target_var)
}
helper.analyze_variable <- function(data, var_name, dependent_var = "Recent.Close.Price") {
# histogram with normal and density curves
print(helper.plot_histograms(data = data, variable_name = var_name))
# 2. QQ Plot
p2 <- ggplot(data, aes(sample = .data[[var_name]])) +
stat_qq() +
stat_qq_line(color = "red") +
theme_minimal() +
labs(title = paste("Q-Q Plot of", var_name))
print(p2)
# Box Plot
p3 <- helper.create_boxplot(data = data, x_var = var_name)
# density Plot
p5 <- ggplot(data, aes_string(x = var_name)) +
geom_density(fill = "lightblue", alpha = 0.7) +
stat_function(fun = dnorm,
args = list(mean = mean(data[[var_name]], na.rm = TRUE),
sd = sd(data[[var_name]], na.rm = TRUE)),
color = "red", linewidth = 1) +
theme_minimal() +
labs(title = paste("Density Plot of", var_name),
x = var_name,
y = "Density")
print(p5)
# Print summary statistics
cat("\nSummary Statistics for", var_name, ":\n")
print(summary(data[[var_name]]))
}
detect_nonlinearities_and_outliers <- function(data, dependent_var, predictor_var) {
# Helper: Perform power transformation and return lambda
get_power_transform <- function(data, predictor_var) {
formula <- as.formula(paste(predictor_var, "~", 1))
power_transform <- powerTransform(formula, data = data)
lambda <-power_transform$lambda
return(lambda)
}
# Helper: Detect influential outliers using Cook's distance
detect_outliers <- function(model, data) {
cooks_d <- cooks.distance(model)
threshold <- 4 * mean(cooks_d, na.rm = TRUE)
influential_points <- which(cooks_d > threshold)
outliers <- data[influential_points, ]
return(list(outliers = outliers, cooks_d = cooks_d))
}
# Step 1: Power Transformation
lambda <- get_power_transform(data, predictor_var)
transformed_var <- paste0(predictor_var, "_transformed")
data[[transformed_var]] <- bcPower(data[[predictor_var]], lambda)
# Step 2: Fit the three models
model1 <- lm(as.formula(paste(dependent_var, "~", predictor_var)), data = data)
model2 <- lm(as.formula(paste(dependent_var, "~", transformed_var)), data = data)
# Detect outliers for Model 1
outlier_info <- detect_outliers(model1, data)
data_without_outliers <- anti_join(data, outlier_info$outliers, by = names(data))
model3 <- lm(as.formula(paste(dependent_var, "~", predictor_var)), data = data_without_outliers)
# Step 3: Summarize models
model_summaries <- list(
base_model = summary(model1),
transformed_model = summary(model2),
model_without_outliers = summary(model3)
)
# Step 4: Visualization
par(mfrow = c(2, 2)) # Set up 2x2 plotting area
# Scatterplot: Original vs. Transformed Predictor
plot(data[[predictor_var]], data[[transformed_var]],
main = "Original vs Transformed Predictor",
xlab = predictor_var,
ylab = transformed_var)
# Residual plots for Model 1
plot(residuals(model1), main = "Residuals of Base Model", xlab = "Index", ylab = "Residuals")
# Residual plots for Model 3 (without outliers)
plot(residuals(model3), main = "Residuals of Model w/o Outliers", xlab = "Index", ylab = "Residuals")
# Cook's distance plot
plot(outlier_info$cooks_d, type = "h", main = "Cook's Distance",
xlab = "Index", ylab = "Cook's Distance")
abline(h = 3 * mean(outlier_info$cooks_d, na.rm = TRUE), col = "red", lty = 2)
# Plot the models themselves
par(mfrow = c(2, 2)) # Reset plotting area
plot(model1, main = "Base Model")
plot(model2, main = "Transformed Model")
plot(model3, main = "Model without Outliers")
# Scatterplots for each model
p1 <- ggplot(data, aes_string(x = predictor_var, y = dependent_var)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
theme_minimal() +
labs(title = "Base Model",
x = predictor_var,
y = dependent_var)
p2 <- ggplot(data, aes_string(x = transformed_var, y = dependent_var)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
theme_minimal() +
labs(title = "Transformed Model",
x = transformed_var,
y = dependent_var)
p3 <- ggplot(data_without_outliers, aes_string(x = predictor_var, y = dependent_var)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
theme_minimal() +
labs(title = "Model without Outliers",
x = predictor_var,
y = dependent_var)
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
# Step 5: Return results
results <- list(
lambda = lambda,
model_summaries = model_summaries,
influential_outliers = outlier_info$outliers,
data_without_outliers = data_without_outliers
)
return(results)
}
helper.remove_specific_rows <- function(data, var1_pattern, var2_pattern) {
# Create logical index for rows to keep
keep_rows <- !(grepl(var1_pattern, data$var1) & grepl(var2_pattern, data$var2) |
grepl(var1_pattern, data$var2) & grepl(var2_pattern, data$var1))
# Subset the data to keep only the desired rows
filtered_data <- data[keep_rows, ]
return(filtered_data)
}
helper.diagnostic_plots <- function(model) {
par(mfrow = c(2, 2)) # Set up a 2x2 plotting layout
# 1. Cook's Distance Plot
plot(model, which = 4, main = "Cook's Distance Plot")
abline(h = 4/length(model$residuals), col = "red", lty = 2) # Add threshold line
legend("topright", legend = "Threshold: 4/n", col = "red", lty = 2, bty = "n")
# 2. Residuals vs Fitted Plot
plot(model$fitted.values, model$residuals,
main = "Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals", pch = 20, col = "blue")
abline(h = 0, col = "red", lty = 2)
# 3. QQ-Plot
qqnorm(model$residuals, main = "QQ-Plot")
qqline(model$residuals, col = "red", lty = 2)
}
stock_data <- read.csv("enhanced_stock_dataset.csv")
na_counts_base <- colSums(is.na(stock_data))
print(na_counts_base)
## Ticker Sector Industry
## 0 0 0
## Region Market.Cap Market.Cap.Classification
## 0 0 0
## Volatility.Classification Growth.vs.Value P.E.Ratio
## 0 0 140
## Dividend.Yield.... X52.Week.High X52.Week.Low
## 0 0 0
## Beta Avg.Volume Recent.Close.Price
## 11 0 0
## EPS Revenue Net.Income
## 0 2 0
## Debt.to.Equity.Ratio ROE P.B.Ratio
## 99 38 38
## Free.Cash.Flow Analyst.Ratings Insider.Transactions
## 82 0 0
stock_data <- na.omit(stock_data)
if (sum(is.na(stock_data)) > 0) {
print("missing values present in data")
#double checking no missing values in data
}
stock_data$X52.Week.High <- NULL
stock_data$X52.Week.Low <- NULL
stock_data$Market.Cap <- NULL
helper.plot_histograms( data = stock_data, color = "green", variable_name = "Recent.Close.Price")
# 1 Variable Selection ### Boruta Algorithm Selection
set.seed(123)
# for some reason R is single threaded for BORUTA so we will set up threading manually because i am not waiting for boruta analysis
cl <- makeCluster(detectCores() - 1) # Use all but one core
registerDoParallel(cl)
boruta_result_stock <- Boruta(Recent.Close.Price ~ ., data = stock_data, doTrace = 2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## After 11 iterations, +2.6 secs:
## confirmed 11 attributes: Avg.Volume, Dividend.Yield...., EPS, Free.Cash.Flow, Growth.vs.Value and 6 more;
## rejected 3 attributes: Analyst.Ratings, Insider.Transactions, Region;
## still have 6 attributes left.
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## 15. run of importance source...
## After 15 iterations, +3.4 secs:
## rejected 1 attribute: Ticker;
## still have 5 attributes left.
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## After 19 iterations, +4.4 secs:
## rejected 2 attributes: Industry, Volatility.Classification;
## still have 3 attributes left.
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## 23. run of importance source...
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## 27. run of importance source...
## 28. run of importance source...
## After 28 iterations, +6.1 secs:
## rejected 1 attribute: Sector;
## still have 2 attributes left.
## 29. run of importance source...
## 30. run of importance source...
## 31. run of importance source...
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## 76. run of importance source...
## After 76 iterations, +15 secs:
## confirmed 1 attribute: Beta;
## still have 1 attribute left.
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## After 83 iterations, +17 secs:
## confirmed 1 attribute: Debt.to.Equity.Ratio;
## no more attributes left.
#stop cluster
stopCluster(cl)
registerDoSEQ()
plot(boruta_result_stock, xlab = "", xaxt = "n", main = "Variable Importance via Boruta")
lz_stock <- lapply(1:ncol(boruta_result_stock$ImpHistory), function(i)
boruta_result_stock$ImpHistory[is.finite(boruta_result_stock$ImpHistory[, i]), i])
names(lz_stock) <- colnames(boruta_result_stock$ImpHistory)
Labels_stock <- sort(sapply(lz_stock, median))
axis(side = 1, las = 2, labels = names(Labels_stock),
at = 1:ncol(boruta_result_stock$ImpHistory), cex.axis = 0.7)
important_vars_stock <- getSelectedAttributes(boruta_result_stock, withTentative = TRUE)
importance_scores_stock <- attStats(boruta_result_stock)
sorted_scores_stock <- importance_scores_stock[order(-importance_scores_stock$meanImp), ]
print(sorted_scores_stock)
## meanImp medianImp minImp maxImp
## EPS 25.01231477 25.15203141 21.4834585 27.842510
## P.E.Ratio 11.02554761 11.21769856 6.8088376 13.350420
## P.B.Ratio 10.89676237 10.86590761 7.9868108 14.105185
## Avg.Volume 8.73858498 8.77847636 7.0982706 10.599089
## Market.Cap.Classification 8.30883087 8.32893204 5.5445765 11.097122
## Net.Income 7.49448151 7.49718636 5.4265206 9.397697
## Revenue 6.36979998 6.35140267 3.6918324 7.941068
## Dividend.Yield.... 5.93230512 5.97045699 3.0633932 8.262782
## Growth.vs.Value 5.90310185 5.92254450 3.3886229 8.133442
## ROE 5.45336923 5.43734902 3.5465118 7.829754
## Free.Cash.Flow 5.11396132 5.06443586 3.5092127 6.546770
## Debt.to.Equity.Ratio 2.59059851 2.60739645 0.1912753 5.057399
## Beta 2.52598841 2.58644623 0.5517994 4.773966
## Sector 0.92621503 0.92198067 -0.8894823 2.854915
## Insider.Transactions 0.75721628 0.99035980 -1.3162031 1.732280
## Volatility.Classification 0.61263422 0.45712574 -0.9238392 3.085234
## Industry 0.57789013 0.33665635 -1.7478825 3.320454
## Ticker 0.10154647 -0.01356575 -1.7902175 1.953404
## Analyst.Ratings 0.09234995 0.14353443 -1.1116901 1.625413
## Region 0.00000000 0.00000000 0.0000000 0.000000
## normHits decision
## EPS 1.00000000 Confirmed
## P.E.Ratio 1.00000000 Confirmed
## P.B.Ratio 1.00000000 Confirmed
## Avg.Volume 1.00000000 Confirmed
## Market.Cap.Classification 1.00000000 Confirmed
## Net.Income 1.00000000 Confirmed
## Revenue 1.00000000 Confirmed
## Dividend.Yield.... 1.00000000 Confirmed
## Growth.vs.Value 1.00000000 Confirmed
## ROE 1.00000000 Confirmed
## Free.Cash.Flow 1.00000000 Confirmed
## Debt.to.Equity.Ratio 0.68674699 Confirmed
## Beta 0.72289157 Confirmed
## Sector 0.06024096 Rejected
## Insider.Transactions 0.00000000 Rejected
## Volatility.Classification 0.02409639 Rejected
## Industry 0.02409639 Rejected
## Ticker 0.01204819 Rejected
## Analyst.Ratings 0.00000000 Rejected
## Region 0.00000000 Rejected
quantitative_vars <- names(stock_data)[sapply(stock_data, is.numeric)]
top_quantitative_scores <- sorted_scores_stock[rownames(sorted_scores_stock) %in% quantitative_vars, ]
top_5_vars_stock <- rownames(top_quantitative_scores)[1:5]
print(top_5_vars_stock)
## [1] "EPS" "P.E.Ratio" "P.B.Ratio" "Avg.Volume" "Net.Income"
I ran Boruta on the variables and was able to come up with 5
quantitative variables that would be best suited for my MLR model. I
chose the five highest quantitative variables by meanImportance from the
Boruta model. The five variables I will be choosing are
"EPS", "P.E.Ratio", "P.B.Ratio",
"Avg.Volume", "Net.Income".
# R is pretty cool because it auto does one-hot encoding for factor variables so that saves us a little code
#this is doing all of the EDA for our factor variables
helper.eda_factor(stock_data, "Sector", "Recent.Close.Price", "blue")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Df Sum Sq Mean Sq F value Pr(>F)
## Sector 10 385350 38535 1.947 0.0377 *
## Residuals 415 8215020 19795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 11 × 4
## Sector mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Financial Services 183. 148. 33
## 2 Industrials 146. 157. 85
## 3 Technology 136. 130. 64
## 4 Consumer Defensive 131. 220. 33
## 5 Healthcare 130. 102. 47
## 6 Consumer Cyclical 122. 121. 52
## 7 Basic Materials 122. 133. 25
## 8 Real Estate 101. 187. 28
## 9 Communication Services 81.7 109. 12
## 10 Utilities 76.3 40.6 25
## 11 Energy 47.6 48.4 22
helper.eda_factor(stock_data, "Industry", "Recent.Close.Price", "green")
## Df Sum Sq Mean Sq F value Pr(>F)
## Industry 114 2493519 21873 1.114 0.234
## Residuals 311 6106851 19636
## # A tibble: 115 × 4
## Industry mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Home Improvement Retail 429. NA 1
## 2 Industrial Distribution 394. 544. 4
## 3 Healthcare Plans 338. NA 1
## 4 Financial Data & Stock Exchanges 327. 145. 6
## 5 Discount Stores 324. 432. 4
## 6 Insurance Brokers 312. NA 1
## 7 Diagnostics & Research 275. 130. 4
## 8 REIT - Specialty 273. 401. 5
## 9 Auto & Truck Dealerships 260. NA 1
## 10 Capital Markets 259. NA 1
## # ℹ 105 more rows
helper.eda_factor(stock_data, "Market.Cap.Classification", "Recent.Close.Price", "purple")
## Df Sum Sq Mean Sq F value Pr(>F)
## Market.Cap.Classification 2 1435734 717867 42.38 <2e-16 ***
## Residuals 423 7164636 16938
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 3 × 4
## Market.Cap.Classification mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Large Cap 186. 171. 187
## 2 Mid Cap 100. 100. 159
## 3 Small Cap 35.9 38.1 80
helper.eda_factor(stock_data, "Volatility.Classification", "Recent.Close.Price", "lightblue")
## Df Sum Sq Mean Sq F value Pr(>F)
## Volatility.Classification 2 153470 76735 3.843 0.0222 *
## Residuals 423 8446900 19969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 3 × 4
## Volatility.Classification mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Low Volatility 146. 167. 104
## 2 Medium Volatility 138. 159. 148
## 3 High Volatility 104. 103. 174
helper.eda_factor(stock_data, "Growth.vs.Value", "Recent.Close.Price", "darkblue")
## Df Sum Sq Mean Sq F value Pr(>F)
## Growth.vs.Value 1 121359 121359 6.069 0.0142 *
## Residuals 424 8479010 19998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 2 × 4
## Growth.vs.Value mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Growth 134. 152. 341
## 2 Value 92.2 88.4 85
helper.eda_factor(stock_data, "Analyst.Ratings", "Recent.Close.Price", "pink")
## Df Sum Sq Mean Sq F value Pr(>F)
## Analyst.Ratings 3 43510 14503 0.715 0.543
## Residuals 422 8556860 20277
## # A tibble: 4 × 4
## Analyst.Ratings mean sd n
## <chr> <dbl> <dbl> <int>
## 1 sell 162. 214. 25
## 2 buy 125. 132. 324
## 3 strongBuy 123. 169. 64
## 4 strongSell 98.3 64.9 13
helper.eda_factor(stock_data, "Insider.Transactions", "Recent.Close.Price", "yellow")
## Df Sum Sq Mean Sq F value Pr(>F)
## Insider.Transactions 2 24936 12468 0.615 0.541
## Residuals 423 8575434 20273
## # A tibble: 3 × 4
## Insider.Transactions mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Negative 132. 135. 267
## 2 Positive 117. 156. 148
## 3 Neutral 108. 100. 11
After conducting EDA for all of our factor variables, there are only
two variables that I feel comfortable selecting. The first is
Market.Cap.Classification and the second is
Growth.vs.Value. I eliminated Industry first as it has too
many categories and would just overfit our model despite its low p-value
from the ANOVA test. I then elimated Analyst.Ratings and
Insider.Transactions due to them having non-significant p-values through
ANOVA. Then I was left with 4 other variables but both
Sector and Volatility.Classification both were
rejected by Boruta and had higher p-values then the chosen variables
which is why I chose to reject them. I was at first surprised by
Sector failing Boruta, but then I realized that stock price
is not an indication of growth rates in a sector and since we have no
growth variables, the strong performance of technology stocks this year
compared to energy stocks was not leveraged the way I thought it would
be.
##Analysis for EPS
helper.analyze_variable(stock_data, "EPS")
##
## Summary Statistics for EPS :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.030 1.383 3.320 5.160 6.603 51.210
results <- detect_nonlinearities_and_outliers(
data = stock_data,
dependent_var = "Recent.Close.Price",
predictor_var = "EPS"
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda # Optimal lambda
## Y1
## 0.170388
results$model_summaries # Summaries of the three models
## $base_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -425.60 -38.96 -20.55 23.90 756.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6797 6.0882 6.517 2.03e-10 ***
## EPS 16.7384 0.7527 22.237 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.77 on 424 degrees of freedom
## Multiple R-squared: 0.5384, Adjusted R-squared: 0.5373
## F-statistic: 494.5 on 1 and 424 DF, p-value: < 2.2e-16
##
##
## $transformed_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.58 -63.26 -18.84 30.21 832.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.321 7.006 5.755 1.66e-08 ***
## EPS_transformed 66.792 3.679 18.156 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.8 on 424 degrees of freedom
## Multiple R-squared: 0.4374, Adjusted R-squared: 0.4361
## F-statistic: 329.6 on 1 and 424 DF, p-value: < 2.2e-16
##
##
## $model_without_outliers
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -208.17 -35.08 -16.07 23.68 373.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.0134 4.7356 6.76 4.7e-11 ***
## EPS 18.3237 0.6903 26.54 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.24 on 414 degrees of freedom
## Multiple R-squared: 0.6299, Adjusted R-squared: 0.629
## F-statistic: 704.6 on 1 and 414 DF, p-value: < 2.2e-16
results$influential_outliers # Name of the outliers
## Ticker Sector Industry Region
## 95 CB Financial Services Insurance - Property & Casualty North America
## 98 GWW Industrials Industrial Distribution North America
## 111 THC Healthcare Medical Care Facilities North America
## 130 CHTR Communication Services Telecom Services North America
## 151 EQIX Real Estate REIT - Specialty North America
## 199 NEU Basic Materials Specialty Chemicals North America
## 268 COST Consumer Defensive Discount Stores North America
## 451 CINF Financial Services Insurance - Property & Casualty North America
## 486 HOV Consumer Cyclical Residential Construction North America
## 635 MTH Consumer Cyclical Residential Construction North America
## Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 95 Large Cap Low Volatility Value
## 98 Large Cap Medium Volatility Growth
## 111 Large Cap High Volatility Value
## 130 Large Cap Medium Volatility Value
## 151 Large Cap Low Volatility Growth
## 199 Mid Cap Low Volatility Value
## 268 Large Cap Low Volatility Growth
## 451 Large Cap Low Volatility Value
## 486 Small Cap High Volatility Value
## 635 Mid Cap High Volatility Value
## P.E.Ratio Dividend.Yield.... Beta Avg.Volume Recent.Close.Price EPS
## 95 11.833198 1.26 0.681 1636872.91 288.73 24.40
## 98 32.682755 0.68 1.151 233209.16 1205.34 36.88
## 111 4.518049 0.00 2.150 1210783.27 142.68 31.58
## 130 12.436247 0.00 1.036 1351296.02 396.97 31.92
## 151 88.501350 1.74 0.701 525569.72 981.48 11.09
## 199 11.848990 1.87 0.512 36585.66 533.56 45.03
## 268 58.830505 0.48 0.789 1990399.60 971.88 16.52
## 451 8.209040 2.03 0.666 689332.27 159.83 19.47
## 486 6.130652 0.00 2.607 82824.70 196.61 32.07
## 635 8.641791 1.57 1.822 411294.02 191.07 22.11
## Revenue Net.Income Debt.to.Equity.Ratio ROE P.B.Ratio
## 95 54697000960 9996999680 31.706 0.16125 1.769667
## 98 16931999744 1828999936 83.325 0.52611 16.759687
## 111 20971999232 3126000128 159.393 0.59565 3.538427
## 130 54869999616 4674999808 533.472 0.32965 4.003601
## 151 8401490944 1056177984 140.964 0.08267 6.969451
## 199 2775260928 430552992 84.902 0.36990 3.752523
## 268 254453006336 7367000064 42.118 0.30267 18.231411
## 451 12154999808 3070000128 6.331 0.25135 1.809731
## 486 2912312064 221342000 167.111 0.41306 2.093845
## 635 6433726976 812387968 27.268 0.17192 1.374377
## Free.Cash.Flow Analyst.Ratings Insider.Transactions EPS_transformed
## 95 13009625088 buy Negative 4.245802
## 98 1546249984 strongBuy Positive 4.983382
## 111 3382625024 buy Negative 4.700260
## 130 3804124928 buy Negative 4.719562
## 151 2946253568 buy Negative 2.974147
## 199 368575488 sell Negative 5.358926
## 268 4467500032 buy Negative 3.595490
## 451 3040124928 buy Positive 3.864194
## 486 -103003128 sell Positive 4.728024
## 635 -350566624 buy Negative 4.077370
In my opinion for EPS there is no reason to do a powerTransform as the R^2 values after the powerTransform actually decreases and although it definately makes it more normally distributed, I do not think it is statistically significant in the premise of making a multiple linear regression model. The next question then comes to outliers. We used Cook’s Distance to show that we have 10 outliers in our model that we identified that have a standard deviation of greater than 4 then our mean Cook’s Distance. I do not think it is worth removing these outliers however as the R^2 does not change drastically with their exclusion.
helper.analyze_variable(stock_data, "P.B.Ratio")
##
## Summary Statistics for P.B.Ratio :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04988 1.78783 2.97480 5.59541 5.47365 138.84380
results <- detect_nonlinearities_and_outliers(
data = stock_data,
dependent_var = "Recent.Close.Price",
predictor_var = "P.B.Ratio"
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda # Optimal lambda
## Y1
## -0.1097533
results$model_summaries # Summaries of the three models
## $base_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333.31 -78.39 -36.82 26.80 1037.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.9936 7.5509 13.90 < 2e-16 ***
## P.B.Ratio 3.7634 0.6433 5.85 9.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 137 on 424 degrees of freedom
## Multiple R-squared: 0.07469, Adjusted R-squared: 0.07251
## F-statistic: 34.22 on 1 and 424 DF, p-value: 9.833e-09
##
##
## $transformed_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -225.79 -65.24 -26.82 25.92 978.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.450 10.584 4.294 2.17e-05 ***
## P.B.Ratio_transformed 74.633 7.895 9.454 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129.4 on 424 degrees of freedom
## Multiple R-squared: 0.1741, Adjusted R-squared: 0.1721
## F-statistic: 89.37 on 1 and 424 DF, p-value: < 2.2e-16
##
##
## $model_without_outliers
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -221.12 -66.45 -28.02 29.01 428.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.1841 6.2032 14.05 < 2e-16 ***
## P.B.Ratio 5.9394 0.7071 8.40 7.04e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.9 on 418 degrees of freedom
## Multiple R-squared: 0.1444, Adjusted R-squared: 0.1424
## F-statistic: 70.56 on 1 and 418 DF, p-value: 7.036e-16
results$influential_outliers # Name of the outliers
## Ticker Sector Industry Region
## 98 GWW Industrials Industrial Distribution North America
## 151 EQIX Real Estate REIT - Specialty North America
## 268 COST Consumer Defensive Discount Stores North America
## 494 VRSK Industrials Consulting Services North America
## 622 GHC Consumer Defensive Education & Training Services North America
## 643 FTNT Technology Software - Infrastructure North America
## Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 98 Large Cap Medium Volatility Growth
## 151 Large Cap Low Volatility Growth
## 268 Large Cap Low Volatility Growth
## 494 Large Cap Medium Volatility Growth
## 622 Mid Cap Medium Volatility Growth
## 643 Large Cap Medium Volatility Growth
## P.E.Ratio Dividend.Yield.... Beta Avg.Volume Recent.Close.Price EPS
## 98 32.68276 0.68 1.151 233209.16 1205.34 36.88
## 151 88.50135 1.74 0.701 525569.72 981.48 11.09
## 268 58.83051 0.48 0.789 1990399.60 971.88 16.52
## 494 45.47295 0.53 0.862 794768.92 294.21 6.47
## 622 18.18239 0.74 1.114 15381.27 931.12 51.21
## 643 47.76382 0.00 0.995 5533398.01 95.05 1.99
## Revenue Net.Income Debt.to.Equity.Ratio ROE P.B.Ratio
## 98 16931999744 1828999936 83.325 0.52611 16.759687
## 151 8401490944 1056177984 140.964 0.08267 6.969451
## 268 254453006336 7367000064 42.118 0.30267 18.231411
## 494 2823300096 930300032 1069.413 2.65609 138.843800
## 622 4711917056 227524992 32.787 0.05939 1.007689
## 643 5710799872 1529900032 118.588 3.11525 80.143340
## Free.Cash.Flow Analyst.Ratings Insider.Transactions P.B.Ratio_transformed
## 98 1546249984 strongBuy Positive 2.42459169
## 151 2946253568 buy Negative 1.74861823
## 268 4467500032 buy Negative 2.48607870
## 494 763787520 buy Negative 3.80942452
## 622 213084880 sell Positive 0.00765657
## 643 1648400000 buy Negative 3.47980961
In my opinion for PB ratio there is a reason to do a powerTransform as the R^2 values after the powerTransform actually increases by a significant amount and it definitely makes it more normally distributed. I would argue this transformation is helpful in making our model more predictive. The one question that I had was that it leads to us having a negative PB ratio for some instances which economically infeasible, but at the same time because this is a transformed variable I do not think this is an issue as all of the variables would be scaled with this transformation. The next question then comes to outliers. We used Cook’s Distance to show that we have 6 outliers in our model that we identified that have a standard deviation of greater than 4 then our mean Cook’s Distance. I do not think it is worth removing these outliers however as the R^2 does not change drastically with their exclusion and is actually lower then the transformed linear model. ## Analysis for PE Ratio
helper.analyze_variable(stock_data, "P.E.Ratio")
##
## Summary Statistics for P.E.Ratio :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.129 16.942 25.877 42.634 38.645 1029.182
results <- detect_nonlinearities_and_outliers(
data = stock_data,
dependent_var = "Recent.Close.Price",
predictor_var = "P.E.Ratio"
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda # Optimal lambda
## Y1
## -0.15614
results$model_summaries # Summaries of the three models
## $base_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126.29 -85.98 -42.26 29.92 1079.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 127.02916 7.84473 16.193 <2e-16 ***
## P.E.Ratio -0.02293 0.08755 -0.262 0.794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.4 on 424 degrees of freedom
## Multiple R-squared: 0.0001618, Adjusted R-squared: -0.002196
## F-statistic: 0.0686 on 1 and 424 DF, p-value: 0.7935
##
##
## $transformed_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -164.39 -80.06 -40.77 31.01 1074.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.36 37.71 0.778 0.43674
## P.E.Ratio_transformed 37.87 14.52 2.607 0.00945 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.3 on 424 degrees of freedom
## Multiple R-squared: 0.01578, Adjusted R-squared: 0.01346
## F-statistic: 6.797 on 1 and 424 DF, p-value: 0.009452
##
##
## $model_without_outliers
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.92 -64.72 -25.04 35.06 324.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.61794 5.09740 20.524 <2e-16 ***
## P.E.Ratio 0.00686 0.06900 0.099 0.921
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.69 on 403 degrees of freedom
## Multiple R-squared: 2.452e-05, Adjusted R-squared: -0.002457
## F-statistic: 0.009883 on 1 and 403 DF, p-value: 0.9209
results$influential_outliers # Name of the outliers
## Ticker Sector Industry Region
## 15 MCO Financial Services Financial Data & Stock Exchanges North America
## 32 MUSA Consumer Cyclical Specialty Retail North America
## 98 GWW Industrials Industrial Distribution North America
## 109 PEN Healthcare Medical Devices North America
## 151 EQIX Real Estate REIT - Specialty North America
## 153 VICR Technology Electronic Components North America
## 188 FDS Financial Services Financial Data & Stock Exchanges North America
## 199 NEU Basic Materials Specialty Chemicals North America
## 238 INCY Healthcare Biotechnology North America
## 268 COST Consumer Defensive Discount Stores North America
## 296 ZI Technology Software - Application North America
## 300 AMP Financial Services Asset Management North America
## 393 MPWR Technology Semiconductors North America
## 395 DE Industrials Farm & Heavy Construction Machinery North America
## 431 CVCO Consumer Cyclical Residential Construction North America
## 515 LIN Basic Materials Specialty Chemicals North America
## 528 LMT Industrials Aerospace & Defense North America
## 544 ADBE Technology Software - Infrastructure North America
## 576 NOC Industrials Aerospace & Defense North America
## 622 GHC Consumer Defensive Education & Training Services North America
## 644 SNPS Technology Software - Infrastructure North America
## Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 15 Large Cap High Volatility Growth
## 32 Large Cap Low Volatility Growth
## 98 Large Cap Medium Volatility Growth
## 109 Mid Cap Low Volatility Growth
## 151 Large Cap Low Volatility Growth
## 153 Mid Cap High Volatility Growth
## 188 Large Cap Low Volatility Growth
## 199 Mid Cap Low Volatility Value
## 238 Large Cap Low Volatility Growth
## 268 Large Cap Low Volatility Growth
## 296 Mid Cap Medium Volatility Growth
## 300 Large Cap High Volatility Growth
## 393 Large Cap Medium Volatility Growth
## 395 Large Cap Medium Volatility Growth
## 431 Mid Cap High Volatility Growth
## 515 Large Cap Medium Volatility Growth
## 528 Large Cap Low Volatility Growth
## 544 Large Cap High Volatility Growth
## 576 Large Cap Low Volatility Growth
## 622 Mid Cap Medium Volatility Growth
## 644 Large Cap Medium Volatility Growth
## P.E.Ratio Dividend.Yield.... Beta Avg.Volume Recent.Close.Price EPS
## 15 45.66027 0.68 1.289 739825.90 499.98 10.95
## 32 22.62701 0.35 0.756 195428.69 547.80 24.21
## 98 32.68276 0.68 1.151 233209.16 1205.34 36.88
## 109 277.40910 0.00 0.514 374408.76 244.12 0.88
## 151 88.50135 1.74 0.701 525569.72 981.48 11.09
## 153 532.10000 0.00 1.487 231219.92 53.21 0.10
## 188 35.32541 0.85 0.752 265147.01 490.67 13.89
## 199 11.84899 1.87 0.512 36585.66 533.56 45.03
## 238 828.77770 0.00 0.710 2351324.30 74.59 0.09
## 268 58.83051 0.48 0.789 1990399.60 971.88 16.52
## 296 364.66666 0.00 1.037 7082583.67 10.94 0.03
## 300 21.93236 1.03 1.334 473387.65 573.97 26.17
## 393 64.06773 0.88 1.148 601479.68 567.64 8.86
## 395 18.18501 1.45 0.935 1489247.81 465.90 25.62
## 431 29.06780 0.00 1.247 65681.67 514.50 17.70
## 515 34.97648 1.21 0.949 1864140.24 460.99 13.18
## 528 19.15376 2.49 0.481 1077877.29 529.41 27.64
## 544 43.68586 0.00 1.299 3166801.59 515.93 11.81
## 576 30.20666 1.68 0.346 844117.93 489.65 16.21
## 622 18.18239 0.74 1.114 15381.27 931.12 51.21
## 644 57.45782 0.00 1.076 1052858.96 558.49 9.72
## Revenue Net.Income Debt.to.Equity.Ratio ROE P.B.Ratio
## 15 6896000000 2003000064 197.861 0.54016 23.204159
## 32 18275201024 510000000 282.568 0.60925 13.362931
## 98 16931999744 1828999936 83.325 0.52611 16.759687
## 109 1163776000 34547000 20.528 0.03129 8.480805
## 151 8401490944 1056177984 140.964 0.08267 6.969451
## 153 355544000 4551000 1.324 0.00842 4.329889
## 188 2203056128 537126016 82.338 0.30411 9.737255
## 199 2775260928 430552992 84.902 0.36990 3.752523
## 238 4075859968 32482000 1.151 0.00802 4.534898
## 268 254453006336 7367000064 42.118 0.30267 18.231411
## 296 1221600000 9000000 82.474 0.00459 2.248715
## 300 17444999168 2707000064 67.402 0.56751 9.764050
## 393 2039447040 434241984 0.693 0.20214 11.774077
## 395 55955001344 8224000000 288.043 0.35481 5.527280
## 431 1851947008 148164992 3.871 0.14376 4.028690
## 515 33024999424 6383000064 54.828 0.16190 5.603447
## 528 71295000576 6674999808 268.347 0.81037 17.353153
## 544 20946999296 5360000000 41.788 0.35355 15.784916
## 576 40985001984 2375000064 122.722 0.15484 4.841165
## 622 4711917056 227524992 32.787 0.05939 1.007689
## 644 6483438080 1511323008 8.700 0.21819 11.128180
## Free.Cash.Flow Analyst.Ratings Insider.Transactions P.E.Ratio_transformed
## 15 1952999936 buy Negative 2.877834
## 32 339112512 buy Positive 2.469241
## 98 1546249984 strongBuy Positive 2.688814
## 109 125270624 buy Negative 3.743680
## 151 2946253568 buy Negative 3.224055
## 153 26165500 buy Positive 4.000981
## 188 555261760 sell Negative 2.733652
## 199 368575488 sell Negative 2.050982
## 238 482474624 strongBuy Negative 4.161656
## 268 4467500032 buy Negative 3.014662
## 296 431324992 buy Negative 3.854913
## 300 4073125120 strongBuy Negative 2.450035
## 393 516699488 buy Negative 3.059501
## 395 3000000000 buy Positive 2.332637
## 431 87199872 buy Negative 2.620182
## 515 4216750080 buy Positive 2.727958
## 528 5304750080 buy Negative 2.365501
## 544 6618500096 buy Negative 2.853408
## 576 2600624896 buy Positive 2.642823
## 622 213084880 sell Positive 2.332545
## 644 -144225504 buy Positive 3.002143
In my opinion for PE ratio there is a reason to do a powerTransform. This is because when we have the base model without outliers the linear model does not even think PE Ratio is a statistically significant variable, but the transformed model says that we do. We know this as the p value is less than 0.05. This is why from now on I will be proceeding with the transformed version of PE for the most part. The next question then comes to outliers. We used Cook’s Distance to show that we have 21 outliers in our model that we identified that have a standard deviation of greater than 4 then our mean Cook’s Distance. I do not think it is worth removing these outliers however as the R^2 does not change drastically with their exclusion and is actually lower then the transformed linear model. Additionally that would mean removing around 5% of our data which would be a lot.
helper.analyze_variable(stock_data, "Avg.Volume")
##
## Summary Statistics for Avg.Volume :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12418 314911 765520 2265197 1734321 96770663
data = stock_data
dependent_var = "Recent.Close.Price"
predictor_var = "Avg.Volume"
results <- detect_nonlinearities_and_outliers(
data = stock_data,
dependent_var = "Recent.Close.Price",
predictor_var = "Avg.Volume"
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda # Optimal lambda
## Y1
## -0.01638639
results$model_summaries # Summaries of the three models
## $base_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.50 -85.52 -42.69 29.98 1079.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.262e+02 7.281e+00 17.336 <2e-16 ***
## Avg.Volume -7.687e-08 1.026e-06 -0.075 0.94
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.4 on 424 degrees of freedom
## Multiple R-squared: 1.323e-05, Adjusted R-squared: -0.002345
## F-statistic: 0.005612 on 1 and 424 DF, p-value: 0.9403
##
##
## $transformed_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.23 -82.78 -41.33 29.30 1070.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 240.215 75.256 3.192 0.00152 **
## Avg.Volume_transformed -9.402 6.172 -1.523 0.12841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142 on 424 degrees of freedom
## Multiple R-squared: 0.005444, Adjusted R-squared: 0.003098
## F-statistic: 2.321 on 1 and 424 DF, p-value: 0.1284
##
##
## $model_without_outliers
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.43 -77.32 -36.00 35.77 454.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.205e+02 5.896e+00 20.431 <2e-16 ***
## Avg.Volume -1.676e-06 1.101e-06 -1.522 0.129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 111.7 on 419 degrees of freedom
## Multiple R-squared: 0.005501, Adjusted R-squared: 0.003128
## F-statistic: 2.318 on 1 and 419 DF, p-value: 0.1287
results$influential_outliers # Name of the outliers
## Ticker Sector Industry Region
## 98 GWW Industrials Industrial Distribution North America
## 151 EQIX Real Estate REIT - Specialty North America
## 268 COST Consumer Defensive Discount Stores North America
## 361 TSLA Consumer Cyclical Auto Manufacturers North America
## 622 GHC Consumer Defensive Education & Training Services North America
## Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 98 Large Cap Medium Volatility Growth
## 151 Large Cap Low Volatility Growth
## 268 Large Cap Low Volatility Growth
## 361 Large Cap High Volatility Growth
## 622 Mid Cap Medium Volatility Growth
## P.E.Ratio Dividend.Yield.... Beta Avg.Volume Recent.Close.Price EPS
## 98 32.68276 0.68 1.151 233209.16 1205.34 36.88
## 151 88.50135 1.74 0.701 525569.72 981.48 11.09
## 268 58.83051 0.48 0.789 1990399.60 971.88 16.52
## 361 94.30601 0.00 2.295 96770662.55 345.16 3.66
## 622 18.18239 0.74 1.114 15381.27 931.12 51.21
## Revenue Net.Income Debt.to.Equity.Ratio ROE P.B.Ratio
## 98 16931999744 1828999936 83.325 0.52611 16.759687
## 151 8401490944 1056177984 140.964 0.08267 6.969451
## 268 254453006336 7367000064 42.118 0.30267 18.231411
## 361 97150001152 12743000064 18.078 0.20389 15.828671
## 622 4711917056 227524992 32.787 0.05939 1.007689
## Free.Cash.Flow Analyst.Ratings Insider.Transactions Avg.Volume_transformed
## 98 1546249984 strongBuy Positive 11.18847
## 151 2946253568 buy Negative 11.84765
## 268 4467500032 buy Negative 12.90911
## 361 676625024 buy Negative 15.87610
## 622 213084880 sell Positive 8.91794
## With this variable a log transformation is also a valid transform so as a result we will test it
data <- stock_data
log_transformed_var <- paste0(predictor_var, "_Log") #take adv of R's env manager
data[[log_transformed_var]] <- log(stock_data[[predictor_var]])
model4 <- lm(as.formula(paste(dependent_var, "~", log_transformed_var)), data = data)
par(mfrow = c(2, 2))
plot(model4, main = "Model with Log Transform")
summary(model4)
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", log_transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.84 -82.41 -41.31 29.25 1070.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 228.587 67.316 3.396 0.000749 ***
## Avg.Volume_Log -7.565 4.940 -1.531 0.126460
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142 on 424 degrees of freedom
## Multiple R-squared: 0.005499, Adjusted R-squared: 0.003154
## F-statistic: 2.345 on 1 and 424 DF, p-value: 0.1265
p4 <- ggplot(data, aes_string(x = log_transformed_var, y = dependent_var)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
theme_minimal() +
labs(title = "Log Model",
x = predictor_var,
y = dependent_var)
print(p4)
## `geom_smooth()` using formula = 'y ~ x'
This variable will likely not be used in future analysis as none of the linear models (base, transformed, or outlier removed) demonstrate statistical significance. This means that this variable, while passing Boruta, does not show much promise. I will continue to use the base variable when I try different models with it, but as of now it looks like I will not be using it in the future. It is important to note here that log transform also worked along with the ideal transform so I tested both here.
helper.analyze_variable(stock_data, "Net.Income")
##
## Summary Statistics for Net.Income :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.302e+06 9.838e+07 3.065e+08 1.070e+09 9.530e+08 3.370e+10
data = stock_data
dependent_var = "Recent.Close.Price"
predictor_var = "Net.Income"
results <- detect_nonlinearities_and_outliers(
data = stock_data,
dependent_var = "Recent.Close.Price",
predictor_var = "Net.Income"
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda # Optimal lambda
## Y1
## 0.02544285
results$model_summaries # Summaries of the three models
## $base_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -437.13 -80.70 -37.90 27.15 1069.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.120e+02 7.276e+00 15.392 < 2e-16 ***
## Net.Income 1.315e-08 2.638e-09 4.985 9.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 138.4 on 424 degrees of freedom
## Multiple R-squared: 0.05536, Adjusted R-squared: 0.05313
## F-statistic: 24.85 on 1 and 424 DF, p-value: 9.058e-07
##
##
## $transformed_model
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.17 -65.02 -22.67 29.32 1014.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -421.626 57.721 -7.305 1.39e-12 ***
## Net.Income_transformed 21.639 2.267 9.545 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129.2 on 424 degrees of freedom
## Multiple R-squared: 0.1769, Adjusted R-squared: 0.1749
## F-statistic: 91.1 on 1 and 424 DF, p-value: < 2.2e-16
##
##
## $model_without_outliers
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)),
## data = data_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -280.47 -73.10 -31.51 27.10 857.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.015e+02 6.565e+00 15.461 < 2e-16 ***
## Net.Income 2.135e-08 3.173e-09 6.726 5.71e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 120.1 on 420 degrees of freedom
## Multiple R-squared: 0.09724, Adjusted R-squared: 0.0951
## F-statistic: 45.24 on 1 and 420 DF, p-value: 5.711e-11
results$influential_outliers # Name of the outliers
## Ticker Sector Industry Region
## 98 GWW Industrials Industrial Distribution North America
## 174 XOM Energy Oil & Gas Integrated North America
## 268 COST Consumer Defensive Discount Stores North America
## 440 WMT Consumer Defensive Discount Stores North America
## Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 98 Large Cap Medium Volatility Growth
## 174 Large Cap Medium Volatility Value
## 268 Large Cap Low Volatility Growth
## 440 Large Cap Low Volatility Growth
## P.E.Ratio Dividend.Yield.... Beta Avg.Volume Recent.Close.Price EPS
## 98 32.68276 0.68 1.151 233209.2 1205.34 36.88
## 174 14.68991 3.36 0.880 16762449.8 117.96 8.03
## 268 58.83051 0.48 0.789 1990399.6 971.88 16.52
## 440 38.22314 0.99 0.516 17702963.3 92.50 2.42
## Revenue Net.Income Debt.to.Equity.Ratio ROE P.B.Ratio
## 98 1.69320e+10 1.8290e+09 83.325 0.52611 16.759687
## 174 3.43818e+11 3.3700e+10 15.394 0.14514 1.930226
## 268 2.54453e+11 7.3670e+09 42.118 0.30267 18.231411
## 440 6.65035e+11 1.5552e+10 69.566 0.18532 8.883981
## Free.Cash.Flow Analyst.Ratings Insider.Transactions Net.Income_transformed
## 98 1546249984 strongBuy Positive 28.31875
## 174 28811499520 buy Negative 33.52234
## 268 4467500032 buy Negative 30.75882
## 440 7701374976 buy Negative 32.10348
## With this variable a log transformation is also a valid transform so as a result we will test it
data <- stock_data
log_transformed_var <- paste0(predictor_var, "_Log") #take adv of R's env manager
data[[log_transformed_var]] <- log(stock_data[[predictor_var]])
model4 <- lm(as.formula(paste(dependent_var, "~", log_transformed_var)), data = data)
par(mfrow = c(2, 2))
plot(model4, main = "Model with Log Transform")
summary(model4)
##
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", log_transformed_var)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -210.61 -65.84 -23.33 28.96 1014.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -564.573 72.829 -7.752 6.77e-14 ***
## Net.Income_Log 35.414 3.721 9.518 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129.3 on 424 degrees of freedom
## Multiple R-squared: 0.176, Adjusted R-squared: 0.1741
## F-statistic: 90.59 on 1 and 424 DF, p-value: < 2.2e-16
p4 <- ggplot(data, aes_string(x = log_transformed_var, y = dependent_var)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
theme_minimal() +
labs(title = "Log Model",
x = predictor_var,
y = dependent_var)
print(p4)
## `geom_smooth()` using formula = 'y ~ x'
stock_data$Net.Income_log = log(stock_data$Net.Income)
For NetIncome I will use a log transform as it gives the highest R-squared and it is a transformation that I am comfortable using. Additionally it does cause a loss of ecnomic understanding as it does not lead to negative values. In the sense of outliers I do not think it is worth trying to remove the four values as they do not lead to a statistically significant difference between R-squared of the base model and the outlier removed model. ## Descriptive Analysis of Factor Variables Please refer to section 1 for boxplots and ANOVA analysis of the chosen factor variables. There do not exist density plots for factor variables and additionally, linearity and outliers are not a concern with factor variables. ## Correlation Analysis
factor_vars <- c("Market.Cap.Classification", "Growth.vs.Value")
selected_vars <- c(factor_vars, top_5_vars_stock)
correlation_data_set <- stock_data[selected_vars]
correlation_data_set$Market.Cap.Classification <- as.factor(correlation_data_set$Market.Cap.Classification)
correlation_data_set$Growth.vs.Value <- as.factor(correlation_data_set$Growth.vs.Value)
correlation_test <- data.frame(lapply(correlation_data_set, function(x) {
if (is.factor(x)) as.numeric(x) else x
}))
cor_matrix <- cor(correlation_test, use = "complete.obs")
strong_correlations <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
strong_pairs <- data.frame(
Var1 = rownames(cor_matrix)[strong_correlations[, 1]],
Var2 = colnames(cor_matrix)[strong_correlations[, 2]],
Correlation = cor_matrix[strong_correlations]
)
print(strong_pairs)
## [1] Var1 Var2 Correlation
## <0 rows> (or 0-length row.names)
# Plot the correlation matrix this is hard to view due to the 21 variables but it still somewhat useful to see
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
There appears to be no variables with high correlation between one another. This means that we can proceed with our linear model with all of our varaibles without having to worry about multicolinearity. I prefer using a correlation matrix over VIF because I am more comfortable with the math behind this one, and I have used it more often. I will still use VIF in step 3. # 3 ## Initial Model Building
# i wanna be able to use this variable when making my models
formula <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
# idw deal with selecting only my chosen vars
model_data <- stock_data[, c("Recent.Close.Price", selected_vars)]
model_data$Net.Income_log <- log(model_data$Net.Income)
# making sure factor vars so we get one off encoding for free
model_data$Market.Cap.Classification <- as.factor(model_data$Market.Cap.Classification)
model_data$Growth.vs.Value <- as.factor(model_data$Growth.vs.Value)
initial_model <- lm(formula, data = model_data)
par(mfrow = c(2, 2)) # Reset plotting area
plot(initial_model, main = "Initial Model")
summary(initial_model)
##
## Call:
## lm(formula = formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -384.95 -32.71 -12.58 23.86 703.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.813e+01 9.277e+00 6.266 9.22e-10 ***
## Market.Cap.ClassificationMid Cap -3.263e+01 9.585e+00 -3.404 0.000728 ***
## Market.Cap.ClassificationSmall Cap -4.458e+01 1.217e+01 -3.662 0.000282 ***
## Growth.vs.ValueValue -9.309e+01 1.066e+01 -8.730 < 2e-16 ***
## EPS 1.752e+01 7.130e-01 24.569 < 2e-16 ***
## P.E.Ratio 1.294e-01 5.176e-02 2.500 0.012801 *
## P.B.Ratio 1.854e+00 3.958e-01 4.684 3.81e-06 ***
## Avg.Volume 2.703e-09 6.574e-07 0.004 0.996721
## Net.Income 7.002e-10 1.833e-09 0.382 0.702633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.44 on 417 degrees of freedom
## Multiple R-squared: 0.6862, Adjusted R-squared: 0.6802
## F-statistic: 114 on 8 and 417 DF, p-value: < 2.2e-16
From the initial model we can eliminate Avg.Volume and
Net.Income as they have p-values of over 0.05. We have a
pretty good R^2 value for stock prediction however. We will test it
later with Net.Income_Log however. ### VIF testing
vif_values <- vif(initial_model)
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## Market.Cap.Classification 1.386469 2 1.085119
## Growth.vs.Value 1.195510 1 1.093394
## EPS 1.298475 1 1.139507
## P.E.Ratio 1.095301 1 1.046566
## P.B.Ratio 1.097711 1 1.047717
## Avg.Volume 1.286456 1 1.134220
## Net.Income 1.429539 1 1.195633
There is nothing that has high multi-colinearity which is the same as the results from the correlation matrix. ## Testing initial model with Net.Income_log and without Avg.Volume
quantitative_vars <- c("EPS", "P.E.Ratio", "P.B.Ratio", "Net.Income_log")
selected_vars <- c(factor_vars, quantitative_vars)
formula <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
model_2 <- lm(formula, data = model_data)
par(mfrow = c(2, 2)) # Reset plotting area
plot(model_2, main = "Model 2")
summary(model_2)
##
## Call:
## lm(formula = formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -383.87 -33.14 -12.11 23.63 702.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.87635 100.23757 0.059 0.9533
## Market.Cap.ClassificationMid Cap -29.36052 12.37351 -2.373 0.0181 *
## Market.Cap.ClassificationSmall Cap -37.05171 20.28498 -1.827 0.0685 .
## Growth.vs.ValueValue -94.27068 11.05572 -8.527 2.78e-16 ***
## EPS 17.45541 0.71415 24.442 < 2e-16 ***
## P.E.Ratio 0.14283 0.05757 2.481 0.0135 *
## P.B.Ratio 1.84279 0.39636 4.649 4.47e-06 ***
## Net.Income_log 2.58590 4.83308 0.535 0.5929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.34 on 418 degrees of freedom
## Multiple R-squared: 0.6863, Adjusted R-squared: 0.6811
## F-statistic: 130.7 on 7 and 418 DF, p-value: < 2.2e-16
After testing this model we can see that even
Net.Income_log is still not statistically significant and
the inclusion of it does not change the R^2 value by a statistically
significant value meaning I will be exclusing
Net.Income_log and Net.Income from future
models. ## Testing Model 3 with only 3 quantiative variables
model_data$Avg.Volume <- NULL
model_data$Net.Income <- NULL
model_data$Net.Income_log <- NULL
quantitative_vars <- c("EPS", "P.E.Ratio", "P.B.Ratio")
selected_vars <- c(factor_vars, quantitative_vars)
formula <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
model_3 <- lm(formula, data = model_data)
par(mfrow = c(2, 2)) # Reset plotting area
plot(model_3, main = "Model 3")
summary(model_3)
##
## Call:
## lm(formula = formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -385.55 -31.80 -12.47 23.44 703.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.30879 8.61666 6.883 2.15e-11 ***
## Market.Cap.ClassificationMid Cap -33.90583 8.98870 -3.772 0.000185 ***
## Market.Cap.ClassificationSmall Cap -45.96666 11.55963 -3.976 8.24e-05 ***
## Growth.vs.ValueValue -92.49990 10.53972 -8.776 < 2e-16 ***
## EPS 17.55041 0.69114 25.394 < 2e-16 ***
## P.E.Ratio 0.12871 0.05112 2.518 0.012178 *
## P.B.Ratio 1.86534 0.39378 4.737 2.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.27 on 419 degrees of freedom
## Multiple R-squared: 0.6861, Adjusted R-squared: 0.6816
## F-statistic: 152.6 on 6 and 419 DF, p-value: < 2.2e-16
So far every single one of these variables are statistically significant so as a result we will proceed with this model for further analysis. Moreover we will be testing these variables with the Ramsey RESET test for model misspecefication and have omitted the two other variables from future consideration and testing. ## Testing Model Misspecification
resettest(model_3, power = 2, type = "regressor")
##
## RESET test
##
## data: model_3
## RESET = 16.289, df1 = 3, df2 = 416, p-value = 4.998e-10
resettest(model_3, power = 3, type = "regressor")
##
## RESET test
##
## data: model_3
## RESET = 8.3554, df1 = 3, df2 = 416, p-value = 2.092e-05
It appears that we need to do a non-linear transformation for at least one of the quantiative variables so we will be trying that for the next slide. ### testing each of the selected variables to see if a transformation is needed
for (var in quantitative_vars) {
# Define formulas
formula_test_square <- as.formula(paste("Recent.Close.Price ~",
paste(selected_vars, collapse = " + "),
"+ I(", var, "^2)"))
formula_test_both <- as.formula(paste("Recent.Close.Price ~",
paste(selected_vars, collapse = " + "),
"+ I(", var, "^2) + I(", var, "^3)"))
formula_test_cube <- as.formula(paste("Recent.Close.Price ~",
paste(selected_vars, collapse = " + "),
"+ I(", var, "^3)"))
formula_test_no_linear <- as.formula(paste("Recent.Close.Price ~",
paste(selected_vars, collapse = " + "),
"+ I(", var, "^2) + I(", var, "^3) -", var))
# Fit models
model_test_square <- lm(formula_test_square, data = model_data)
model_test_both <- lm(formula_test_both, data = model_data)
model_test_cube <- lm(formula_test_cube, data = model_data)
model_test_no_linear <- lm(formula_test_no_linear, data = model_data)
# Print BIC values
print(paste("Testing non-linearity for variable:", var))
print(BIC(model_3, model_test_square, model_test_cube, model_test_both, model_test_no_linear))
# Extract R-squared and Adjusted R-squared values
r_squared_values <- c(
summary(model_test_square)$r.squared,
summary(model_test_cube)$r.squared,
summary(model_test_both)$r.squared,
summary(model_test_no_linear)$r.squared
)
adj_r_squared_values <- c(
summary(model_test_square)$adj.r.squared,
summary(model_test_cube)$adj.r.squared,
summary(model_test_both)$adj.r.squared,
summary(model_test_no_linear)$adj.r.squared
)
# Print R-squared and Adjusted R-squared values
print(paste("R-squared values for models with", var, ":"))
print(r_squared_values)
print(paste("Adjusted R-squared values for models with", var, ":"))
print(adj_r_squared_values)
}
## [1] "Testing non-linearity for variable: EPS"
## df BIC
## model_3 8 4986.635
## model_test_square 9 4979.291
## model_test_cube 9 4986.125
## model_test_both 10 4966.657
## model_test_no_linear 9 5080.614
## [1] "R-squared values for models with EPS :"
## [1] 0.6958361 0.6909166 0.7088909 0.6141627
## [1] "Adjusted R-squared values for models with EPS :"
## [1] 0.6907424 0.6857406 0.7033061 0.6077013
## [1] "Testing non-linearity for variable: P.E.Ratio"
## df BIC
## model_3 8 4986.635
## model_test_square 9 4981.593
## model_test_cube 9 4986.910
## model_test_both 10 4963.063
## model_test_no_linear 9 4997.502
## [1] "R-squared values for models with P.E.Ratio :"
## [1] 0.6941874 0.6903470 0.7113365 0.6825509
## [1] "Adjusted R-squared values for models with P.E.Ratio :"
## [1] 0.6890662 0.6851614 0.7057986 0.6772347
## [1] "Testing non-linearity for variable: P.B.Ratio"
## df BIC
## model_3 8 4986.635
## model_test_square 9 4971.871
## model_test_cube 9 4980.229
## model_test_both 10 4959.054
## model_test_no_linear 9 5003.738
## [1] "R-squared values for models with P.B.Ratio :"
## [1] 0.7010876 0.6951656 0.7140402 0.6778703
## [1] "Adjusted R-squared values for models with P.B.Ratio :"
## [1] 0.6960819 0.6900607 0.7085542 0.6724757
It appears through testing of trying square and cube powers of the
quantitative variables that for all of our variables we would prefer a
model that includes a linear, quadratic, and cubic, representation of
each variable. We have BIC and adjusted R^2 values of our individual
model, but now we will combine the different models to figure out which
one is the best representation. So far the best model in terms of BIC is
model_test_both with P.B.Ratio being
transformed only with an BIC of 4959.054 All further models will be
tested against this it also has adjusted R^2 of 0.7085542. We are using
BIC instead of AIC right now because we want to focus on selecting the
best possible model for our current data, because we are not planning on
doing any forecasting. We will use AIC at the end however to confirm our
selections.
### Testing possible Permutations to find best model through AIC and
adjusted R^2
# Function to fit models and compare adjusted R-squared values
compare_models <- function(data, dependent_var, pairs) {
results <- list()
for (pair in pairs) {
var1 <- pair[1]
var2 <- pair[2]
# Define formulas
formula_test <- as.formula(paste(dependent_var, "~",
paste(selected_vars, collapse = " + "),
"+ I(", var1, "^2) + I(", var1, "^3) + I(", var2, "^2) + I(", var2, "^3)"))
# fit our model
model_test <- lm(formula_test, data = data)
adj_r_squared <- summary(model_test)$adj.r.squared
bic_result <- BIC(model_test)
# store results
results[[paste(var1, var2, sep = "_")]] <- list(adj_r_squared = adj_r_squared, BIC = bic_result)
}
# now doing for all three
var1 <- quantitative_vars[1]
var2 <- quantitative_vars[2]
var3 <- quantitative_vars[3]
formula_test <- as.formula(paste("Recent.Close.Price ~",
paste(selected_vars, collapse = " + "),
"+ I(", var1, "^2) + I(", var1, "^3) + I(", var2, "^2) + I(", var2, "^3) + I(", var3, "^2) + I(", var3,
"^3)"))
model_test <- lm(formula_test, data = data)
adj_r_squared <- summary(model_test)$adj.r.squared
bic_result <- BIC(model_test)
# store results
results[[paste(var1, var2, var3, sep = "_")]] <- list(adj_r_squared = adj_r_squared, BIC = bic_result)
return(results)
}
dependent_var <- "Recent.Close.Price"
# shoutout R for having a free permutation generator
pairs <- combn(quantitative_vars, 2, simplify = FALSE)
# run the function
results <- compare_models(model_data, dependent_var, pairs)
# Print results
print("Adjusted R-squared and BIC values for all pairs:")
## [1] "Adjusted R-squared and BIC values for all pairs:"
print(results)
## $EPS_P.E.Ratio
## $EPS_P.E.Ratio$adj_r_squared
## [1] 0.7427126
##
## $EPS_P.E.Ratio$BIC
## [1] 4916.01
##
##
## $EPS_P.B.Ratio
## $EPS_P.B.Ratio$adj_r_squared
## [1] 0.7276459
##
## $EPS_P.B.Ratio$BIC
## [1] 4940.253
##
##
## $P.E.Ratio_P.B.Ratio
## $P.E.Ratio_P.B.Ratio$adj_r_squared
## [1] 0.7276538
##
## $P.E.Ratio_P.B.Ratio$BIC
## [1] 4940.241
##
##
## $EPS_P.E.Ratio_P.B.Ratio
## $EPS_P.E.Ratio_P.B.Ratio$adj_r_squared
## [1] 0.7597781
##
## $EPS_P.E.Ratio_P.B.Ratio$BIC
## [1] 4896.824
After Conducting a test of all possible permutations it looks like a model with all three of them having squared and cubed terms is the model that has the highest adjusted R^2 value with the lowest BIC model so we will proceed with this for now. The one worry I have is that I am artifically increasing R^2 by adding more variables, but since I am adjusted R^2 and BIC, I think it is fine. ### Summary of Model after testing for model misspecification
poly_terms <- c()
for (var in quantitative_vars) {
poly_terms <- c(poly_terms, paste0("I(", var, "^2)"), paste0("I(", var, "^3)"))
}
selected_vars <- c(quantitative_vars, factor_vars, poly_terms)
formula_test <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
model_4 <- lm(formula_test, data = model_data)
par(mfrow = c(2, 2)) # Reset plotting area
plot(model_4, main = "Model 4")
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
print(BIC(model_4))
## [1] 4896.824
summary(model_4)
##
## Call:
## lm(formula = formula_test, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -265.26 -27.65 -3.01 18.83 600.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.902e+01 1.401e+01 -5.639 3.18e-08 ***
## EPS 3.739e+01 2.686e+00 13.920 < 2e-16 ***
## P.E.Ratio 1.738e+00 2.115e-01 8.218 2.70e-15 ***
## P.B.Ratio 8.289e+00 1.325e+00 6.256 9.86e-10 ***
## Market.Cap.ClassificationMid Cap -1.703e+01 8.067e+00 -2.111 0.035376 *
## Market.Cap.ClassificationSmall Cap 3.121e+00 1.112e+01 0.281 0.779173
## Growth.vs.ValueValue -6.494e+01 1.009e+01 -6.433 3.47e-10 ***
## I(EPS^2) -1.170e+00 1.698e-01 -6.893 2.05e-11 ***
## I(EPS^3) 1.598e-02 2.677e-03 5.969 5.13e-09 ***
## I(P.E.Ratio^2) -4.946e-03 7.458e-04 -6.631 1.04e-10 ***
## I(P.E.Ratio^3) 3.388e-06 5.788e-07 5.853 9.84e-09 ***
## I(P.B.Ratio^2) -1.563e-01 3.375e-02 -4.632 4.86e-06 ***
## I(P.B.Ratio^3) 7.375e-04 1.917e-04 3.847 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.72 on 413 degrees of freedom
## Multiple R-squared: 0.7666, Adjusted R-squared: 0.7598
## F-statistic: 113 on 12 and 413 DF, p-value: < 2.2e-16
As of now this appears to be our best model although it does reduce the importance of the market_cap classification it is still the highest R^2 model we have. We will proceed with it as our fourth possible model, and perform tests to see if we need interaction terms with it. It has an adjusted r^2 of 0.7597781 with a BIC of 4896.824. ## Testing for Need of interaction terms
# Function to test if interaction terms are needed
test_interaction_terms <- function(data, dependent_var, selected_vars) {
results <- data.frame(
model = character(),
var1 = character(),
var2 = character(),
adj_r_squared = numeric(),
AIC = numeric(),
BIC = numeric(),
stringsAsFactors = FALSE
)
# Define base model formula
base_formula <- as.formula(paste(dependent_var, "~", paste(selected_vars, collapse = " + ")))
# Fit base model
base_model <- lm(base_formula, data = data)
# Extract adjusted R-squared, AIC, and BIC for base model
base_adj_r_squared <- summary(base_model)$adj.r.squared
base_aic <- AIC(base_model)
base_bic <- BIC(base_model)
# Store base model results
results <- rbind(results, data.frame(
model = "base_model",
var1 = NA,
var2 = NA,
adj_r_squared = base_adj_r_squared,
AIC = base_aic,
BIC = base_bic,
stringsAsFactors = FALSE
))
# Generate all possible pairs of selected variables for interaction terms
pairs <- combn(selected_vars, 2, simplify = FALSE)
for (pair in pairs) {
# i was getting duplicate this elimates it (still computationally longer but too lazy to figure out optimizations)
pair <- sort(pair)
var1 <- pair[1]
var2 <- pair[2]
# Define formula with interaction term
interaction_formula <- as.formula(paste(dependent_var, "~",
paste(selected_vars, collapse = " + "),
"+", paste(var1, "*", var2)))
# Fit model with interaction term
interaction_model <- lm(interaction_formula, data = data)
# Extract adjusted R-squared, AIC, and BIC for interaction model
interaction_adj_r_squared <- summary(interaction_model)$adj.r.squared
interaction_aic <- AIC(interaction_model)
interaction_bic <- BIC(interaction_model)
# Store interaction model results
results <- rbind(results, data.frame(
model = paste(var1, var2, sep = "_"),
var1 = var1,
var2 = var2,
adj_r_squared = interaction_adj_r_squared,
AIC = interaction_aic,
BIC = interaction_bic,
stringsAsFactors = FALSE
))
}
return(results)
}
# Example usage
dependent_var <- "Recent.Close.Price"
# Test for interaction terms
interaction_results <- test_interaction_terms(model_data, dependent_var, selected_vars)
interaction_results <- sort_by(interaction_results, interaction_results$BIC)
# we need to remove all instance of eps and p.e.ratio being interaction variables because the product of those is just the share price making the entire model useless. So this is why i am removing them from potential interaction terms.
interaction_results <- helper.remove_specific_rows(interaction_results, "EPS", "P.E.Ratio")
# Print results
print("Model comparison results with interaction terms:")
## [1] "Model comparison results with interaction terms:"
print(interaction_results)
## model var1
## 5 EPS_Growth.vs.Value EPS
## 23 I(EPS^2)_P.B.Ratio I(EPS^2)
## 24 I(EPS^3)_P.B.Ratio I(EPS^3)
## 3 EPS_P.B.Ratio EPS
## 36 Growth.vs.Value_I(EPS^2) Growth.vs.Value
## 49 I(EPS^3)_I(P.B.Ratio^2) I(EPS^3)
## 31 I(EPS^3)_Market.Cap.Classification I(EPS^3)
## 37 Growth.vs.Value_I(EPS^3) Growth.vs.Value
## 30 I(EPS^2)_Market.Cap.Classification I(EPS^2)
## 18 I(P.E.Ratio^3)_P.E.Ratio I(P.E.Ratio^3)
## 4 EPS_Market.Cap.Classification EPS
## 51 I(P.E.Ratio^2)_I(P.E.Ratio^3) I(P.E.Ratio^2)
## 45 I(EPS^2)_I(P.B.Ratio^2) I(EPS^2)
## 21 Market.Cap.Classification_P.B.Ratio Market.Cap.Classification
## 13 Market.Cap.Classification_P.E.Ratio Market.Cap.Classification
## 29 Growth.vs.Value_Market.Cap.Classification Growth.vs.Value
## 10 EPS_I(P.B.Ratio^2) EPS
## 50 I(EPS^3)_I(P.B.Ratio^3) I(EPS^3)
## 32 I(P.E.Ratio^2)_Market.Cap.Classification I(P.E.Ratio^2)
## 22 Growth.vs.Value_P.B.Ratio Growth.vs.Value
## 34 I(P.B.Ratio^2)_Market.Cap.Classification I(P.B.Ratio^2)
## 1 base_model <NA>
## 6 EPS_I(EPS^2) EPS
## 17 I(P.E.Ratio^2)_P.E.Ratio I(P.E.Ratio^2)
## 27 I(P.B.Ratio^2)_P.B.Ratio I(P.B.Ratio^2)
## 33 I(P.E.Ratio^3)_Market.Cap.Classification I(P.E.Ratio^3)
## 55 I(P.B.Ratio^3)_I(P.E.Ratio^3) I(P.B.Ratio^3)
## 54 I(P.B.Ratio^2)_I(P.E.Ratio^3) I(P.B.Ratio^2)
## 26 I(P.E.Ratio^3)_P.B.Ratio I(P.E.Ratio^3)
## 52 I(P.B.Ratio^2)_I(P.E.Ratio^2) I(P.B.Ratio^2)
## 53 I(P.B.Ratio^3)_I(P.E.Ratio^2) I(P.B.Ratio^3)
## 46 I(EPS^2)_I(P.B.Ratio^3) I(EPS^2)
## 25 I(P.E.Ratio^2)_P.B.Ratio I(P.E.Ratio^2)
## 35 I(P.B.Ratio^3)_Market.Cap.Classification I(P.B.Ratio^3)
## 40 Growth.vs.Value_I(P.B.Ratio^2) Growth.vs.Value
## 39 Growth.vs.Value_I(P.E.Ratio^3) Growth.vs.Value
## 19 I(P.B.Ratio^2)_P.E.Ratio I(P.B.Ratio^2)
## 38 Growth.vs.Value_I(P.E.Ratio^2) Growth.vs.Value
## 11 EPS_I(P.B.Ratio^3) EPS
## 42 I(EPS^2)_I(EPS^3) I(EPS^2)
## 20 I(P.B.Ratio^3)_P.E.Ratio I(P.B.Ratio^3)
## 56 I(P.B.Ratio^2)_I(P.B.Ratio^3) I(P.B.Ratio^2)
## 7 EPS_I(EPS^3) EPS
## 28 I(P.B.Ratio^3)_P.B.Ratio I(P.B.Ratio^3)
## 14 Growth.vs.Value_P.E.Ratio Growth.vs.Value
## 41 Growth.vs.Value_I(P.B.Ratio^3) Growth.vs.Value
## 12 P.B.Ratio_P.E.Ratio P.B.Ratio
## var2 adj_r_squared AIC BIC
## 5 Growth.vs.Value 0.8164550 4726.391 4787.208
## 23 P.B.Ratio 0.8105643 4739.848 4800.665
## 24 P.B.Ratio 0.8082636 4744.991 4805.808
## 3 P.B.Ratio 0.7969476 4769.419 4830.235
## 36 I(EPS^2) 0.7948698 4773.756 4834.573
## 49 I(P.B.Ratio^2) 0.7888001 4786.178 4846.995
## 31 Market.Cap.Classification 0.7830120 4798.661 4863.532
## 37 I(EPS^3) 0.7795155 4804.506 4865.322
## 30 Market.Cap.Classification 0.7819840 4800.674 4865.545
## 18 P.E.Ratio 0.7747108 4813.689 4874.506
## 4 Market.Cap.Classification 0.7765463 4811.169 4876.040
## 51 I(P.E.Ratio^3) 0.7725229 4817.806 4878.623
## 45 I(P.B.Ratio^2) 0.7716715 4819.398 4880.214
## 21 P.B.Ratio 0.7701138 4823.259 4888.130
## 13 P.E.Ratio 0.7700402 4823.395 4888.266
## 29 Market.Cap.Classification 0.7677172 4827.677 4892.548
## 10 I(P.B.Ratio^2) 0.7642560 4833.013 4893.830
## 50 I(P.B.Ratio^3) 0.7638947 4833.666 4894.482
## 32 Market.Cap.Classification 0.7661946 4830.460 4895.331
## 22 P.B.Ratio 0.7634125 4834.535 4895.351
## 34 Market.Cap.Classification 0.7660776 4830.673 4895.544
## 1 <NA> 0.7597781 4840.062 4896.824
## 6 I(EPS^2) 0.7597781 4840.062 4896.824
## 17 P.E.Ratio 0.7597781 4840.062 4896.824
## 27 P.B.Ratio 0.7597781 4840.062 4896.824
## 33 Market.Cap.Classification 0.7653682 4831.963 4896.835
## 55 I(P.E.Ratio^3) 0.7619569 4837.148 4897.964
## 54 I(P.E.Ratio^3) 0.7619561 4837.149 4897.966
## 26 P.B.Ratio 0.7616645 4837.671 4898.487
## 52 I(P.E.Ratio^2) 0.7614559 4838.043 4898.860
## 53 I(P.E.Ratio^2) 0.7613985 4838.146 4898.962
## 46 I(P.B.Ratio^3) 0.7609455 4838.954 4899.770
## 25 P.B.Ratio 0.7608145 4839.187 4900.004
## 35 Market.Cap.Classification 0.7634789 4835.380 4900.251
## 40 I(P.B.Ratio^2) 0.7605664 4839.629 4900.445
## 39 I(P.E.Ratio^3) 0.7602069 4840.268 4901.085
## 19 P.E.Ratio 0.7600302 4840.582 4901.398
## 38 I(P.E.Ratio^2) 0.7600216 4840.597 4901.414
## 11 I(P.B.Ratio^3) 0.7599919 4840.650 4901.466
## 42 I(EPS^3) 0.7598407 4840.918 4901.735
## 20 P.E.Ratio 0.7598363 4840.926 4901.742
## 56 I(P.B.Ratio^3) 0.7598001 4840.990 4901.807
## 7 I(EPS^3) 0.7597492 4841.080 4901.897
## 28 P.B.Ratio 0.7597248 4841.124 4901.940
## 14 P.E.Ratio 0.7596689 4841.223 4902.039
## 41 I(P.B.Ratio^3) 0.7596402 4841.274 4902.090
## 12 P.E.Ratio 0.7594031 4841.693 4902.510
I think that I could potentially extrapolate this formula to get to 3 level interaction variables and test multiple 2 way interactions but I feel this would be an infinitely regressive approach and am just going to proceed with the top model as my best model for now (model5). I am worried about overfitting when trying mutiple 2 way interactions and increased complexity. 3 way interaction variables are just things that I have never seen in all my years of doing stats so I will not proceed with that approach.
interaction_formula <- as.formula(paste(dependent_var, "~",
paste(selected_vars, collapse = " + "),
"+", paste("EPS", "*", "Growth.vs.Value")))
model_5 <- lm(interaction_formula, data = model_data)
par(mfrow = c(2, 2)) # Reset plotting area
plot(model_5, main = "Model 5")
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
print(BIC(model_5))
## [1] 4787.208
summary(model_5)
##
## Call:
## lm(formula = interaction_formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174.16 -20.52 1.16 17.95 579.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.159e+01 1.225e+01 -6.660 8.81e-11 ***
## EPS 3.361e+01 2.372e+00 14.173 < 2e-16 ***
## P.E.Ratio 1.901e+00 1.854e-01 10.253 < 2e-16 ***
## P.B.Ratio 7.069e+00 1.163e+00 6.078 2.77e-09 ***
## Market.Cap.ClassificationMid Cap -1.294e+01 7.060e+00 -1.833 0.067482 .
## Market.Cap.ClassificationSmall Cap -6.208e+00 9.757e+00 -0.636 0.524986
## Growth.vs.ValueValue 2.598e+01 1.192e+01 2.179 0.029927 *
## I(EPS^2) -5.683e-01 1.576e-01 -3.605 0.000351 ***
## I(EPS^3) 6.707e-03 2.479e-03 2.706 0.007097 **
## I(P.E.Ratio^2) -5.419e-03 6.533e-04 -8.295 1.56e-15 ***
## I(P.E.Ratio^3) 3.705e-06 5.067e-07 7.313 1.38e-12 ***
## I(P.B.Ratio^2) -1.355e-01 2.956e-02 -4.585 6.02e-06 ***
## I(P.B.Ratio^3) 6.500e-04 1.677e-04 3.875 0.000124 ***
## EPS:Growth.vs.ValueValue -1.265e+01 1.116e+00 -11.337 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.94 on 412 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8165
## F-statistic: 146.4 on 13 and 412 DF, p-value: < 2.2e-16
As of now this appears to be our best model although it does reduce
the importance of the market_cap classification it is still the highest
R^2 model we have. We will proceed with it as our fifth possible model,
and perform all comparison tests now with our 5 potential models.
(initial_model, model_2, model_3,
model_4, and now model_5). ## Diagnostic Plots
of our 5 models
helper.diagnostic_plots(initial_model)
summary(initial_model)
##
## Call:
## lm(formula = formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -384.95 -32.71 -12.58 23.86 703.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.813e+01 9.277e+00 6.266 9.22e-10 ***
## Market.Cap.ClassificationMid Cap -3.263e+01 9.585e+00 -3.404 0.000728 ***
## Market.Cap.ClassificationSmall Cap -4.458e+01 1.217e+01 -3.662 0.000282 ***
## Growth.vs.ValueValue -9.309e+01 1.066e+01 -8.730 < 2e-16 ***
## EPS 1.752e+01 7.130e-01 24.569 < 2e-16 ***
## P.E.Ratio 1.294e-01 5.176e-02 2.500 0.012801 *
## P.B.Ratio 1.854e+00 3.958e-01 4.684 3.81e-06 ***
## Avg.Volume 2.703e-09 6.574e-07 0.004 0.996721
## Net.Income 7.002e-10 1.833e-09 0.382 0.702633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.44 on 417 degrees of freedom
## Multiple R-squared: 0.6862, Adjusted R-squared: 0.6802
## F-statistic: 114 on 8 and 417 DF, p-value: < 2.2e-16
helper.diagnostic_plots(model_2)
summary(model_2)
##
## Call:
## lm(formula = formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -383.87 -33.14 -12.11 23.63 702.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.87635 100.23757 0.059 0.9533
## Market.Cap.ClassificationMid Cap -29.36052 12.37351 -2.373 0.0181 *
## Market.Cap.ClassificationSmall Cap -37.05171 20.28498 -1.827 0.0685 .
## Growth.vs.ValueValue -94.27068 11.05572 -8.527 2.78e-16 ***
## EPS 17.45541 0.71415 24.442 < 2e-16 ***
## P.E.Ratio 0.14283 0.05757 2.481 0.0135 *
## P.B.Ratio 1.84279 0.39636 4.649 4.47e-06 ***
## Net.Income_log 2.58590 4.83308 0.535 0.5929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.34 on 418 degrees of freedom
## Multiple R-squared: 0.6863, Adjusted R-squared: 0.6811
## F-statistic: 130.7 on 7 and 418 DF, p-value: < 2.2e-16
helper.diagnostic_plots(model_3)
summary(model_3)
##
## Call:
## lm(formula = formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -385.55 -31.80 -12.47 23.44 703.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.30879 8.61666 6.883 2.15e-11 ***
## Market.Cap.ClassificationMid Cap -33.90583 8.98870 -3.772 0.000185 ***
## Market.Cap.ClassificationSmall Cap -45.96666 11.55963 -3.976 8.24e-05 ***
## Growth.vs.ValueValue -92.49990 10.53972 -8.776 < 2e-16 ***
## EPS 17.55041 0.69114 25.394 < 2e-16 ***
## P.E.Ratio 0.12871 0.05112 2.518 0.012178 *
## P.B.Ratio 1.86534 0.39378 4.737 2.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.27 on 419 degrees of freedom
## Multiple R-squared: 0.6861, Adjusted R-squared: 0.6816
## F-statistic: 152.6 on 6 and 419 DF, p-value: < 2.2e-16
helper.diagnostic_plots(model_4)
summary(model_4)
##
## Call:
## lm(formula = formula_test, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -265.26 -27.65 -3.01 18.83 600.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.902e+01 1.401e+01 -5.639 3.18e-08 ***
## EPS 3.739e+01 2.686e+00 13.920 < 2e-16 ***
## P.E.Ratio 1.738e+00 2.115e-01 8.218 2.70e-15 ***
## P.B.Ratio 8.289e+00 1.325e+00 6.256 9.86e-10 ***
## Market.Cap.ClassificationMid Cap -1.703e+01 8.067e+00 -2.111 0.035376 *
## Market.Cap.ClassificationSmall Cap 3.121e+00 1.112e+01 0.281 0.779173
## Growth.vs.ValueValue -6.494e+01 1.009e+01 -6.433 3.47e-10 ***
## I(EPS^2) -1.170e+00 1.698e-01 -6.893 2.05e-11 ***
## I(EPS^3) 1.598e-02 2.677e-03 5.969 5.13e-09 ***
## I(P.E.Ratio^2) -4.946e-03 7.458e-04 -6.631 1.04e-10 ***
## I(P.E.Ratio^3) 3.388e-06 5.788e-07 5.853 9.84e-09 ***
## I(P.B.Ratio^2) -1.563e-01 3.375e-02 -4.632 4.86e-06 ***
## I(P.B.Ratio^3) 7.375e-04 1.917e-04 3.847 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.72 on 413 degrees of freedom
## Multiple R-squared: 0.7666, Adjusted R-squared: 0.7598
## F-statistic: 113 on 12 and 413 DF, p-value: < 2.2e-16
helper.diagnostic_plots(model_5)
summary(model_5)
##
## Call:
## lm(formula = interaction_formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174.16 -20.52 1.16 17.95 579.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.159e+01 1.225e+01 -6.660 8.81e-11 ***
## EPS 3.361e+01 2.372e+00 14.173 < 2e-16 ***
## P.E.Ratio 1.901e+00 1.854e-01 10.253 < 2e-16 ***
## P.B.Ratio 7.069e+00 1.163e+00 6.078 2.77e-09 ***
## Market.Cap.ClassificationMid Cap -1.294e+01 7.060e+00 -1.833 0.067482 .
## Market.Cap.ClassificationSmall Cap -6.208e+00 9.757e+00 -0.636 0.524986
## Growth.vs.ValueValue 2.598e+01 1.192e+01 2.179 0.029927 *
## I(EPS^2) -5.683e-01 1.576e-01 -3.605 0.000351 ***
## I(EPS^3) 6.707e-03 2.479e-03 2.706 0.007097 **
## I(P.E.Ratio^2) -5.419e-03 6.533e-04 -8.295 1.56e-15 ***
## I(P.E.Ratio^3) 3.705e-06 5.067e-07 7.313 1.38e-12 ***
## I(P.B.Ratio^2) -1.355e-01 2.956e-02 -4.585 6.02e-06 ***
## I(P.B.Ratio^3) 6.500e-04 1.677e-04 3.875 0.000124 ***
## EPS:Growth.vs.ValueValue -1.265e+01 1.116e+00 -11.337 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.94 on 412 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8165
## F-statistic: 146.4 on 13 and 412 DF, p-value: < 2.2e-16
# Compare AIC and BIC across models
compare_models <- function(models) {
results <- data.frame(
model = names(models),
AIC = sapply(models, AIC),
BIC = sapply(models, BIC)
)
return(results[order(results$AIC), ])
}
# Example usage
models <- list(model_1 = initial_model, model_2 = model_2, model_3 = model_3, model_4 = model_4, model_5 = model_5)
compare_models(models)
## model AIC BIC
## model_5 model_5 4726.391 4787.208
## model_4 model_4 4840.062 4896.824
## model_3 model_3 4954.199 4986.635
## model_2 model_2 4955.908 4992.397
## model_1 model_1 4958.026 4998.571
cross_validate_model <- function(model, data, model_name, k = 10) {
control <- trainControl(method = "cv", number = k)
cv_model <- train(formula(model), data = data, method = "lm", trControl = control)
results <- cv_model$results
results$model <- model_name
return(results)
}
# I wanted to return as a dataframe so i did like this instead of printing out each one
cv_results_list <- list()
cv_results_list[[1]] <- cross_validate_model(model_5, model_data, "Model 5")
cv_results_list[[2]] <- cross_validate_model(model_4, model_data, "Model 4")
cv_results_list[[3]] <- cross_validate_model(model_3, model_data, "Model 3")
cv_results_list[[4]] <- cross_validate_model(model_2, stock_data, "Model 2")
cv_results_list[[5]] <- cross_validate_model(initial_model, stock_data, "Initial Model")
cv_results_df <- do.call(rbind, cv_results_list)
print(cv_results_df)
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 71.20719 0.7751845 38.97920 48.52094 0.1937944 14.705386
## 2 TRUE 106.23586 0.6957015 48.33682 86.00764 0.1748311 16.597494
## 3 TRUE 79.50272 0.6970157 47.85845 22.89017 0.1436857 5.013735
## 4 TRUE 78.67322 0.7093004 48.23251 29.31212 0.1775683 11.923131
## 5 TRUE 78.81967 0.6738216 48.38239 28.70402 0.1688269 10.694216
## model
## 1 Model 5
## 2 Model 4
## 3 Model 3
## 4 Model 2
## 5 Initial Model
I will be choosing model_5 as our model and will proceed with it for all future tests and bootstrapping. This was known before, but I wanted to confirm that it was the best model. This is because it has the lowest AIC and BIC scores while at the same time having the lowest MAE score. Although it has a high RMSE, I am willing to overlook this because of its high performance in R^2, AIC, and BIC. This does raise questions about overfitting however as the cross-validation test is non-deterministic. ## Testing/Bootstrapping of Model 5 ### Multicollinearity Testing
vif_values <- vif(model_5, type = "predictor")
## GVIFs computed for predictors
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## EPS 2.349184 5 1.089160
## P.E.Ratio 1.635818 3 1.085482
## P.B.Ratio 1.351535 3 1.051489
## Market.Cap.Classification 1.527383 2 1.111698
## Growth.vs.Value 2.349184 5 1.089160
## Interacts With
## EPS I(EPS^2), I(EPS^3), Growth.vs.Value
## P.E.Ratio I(P.E.Ratio^2), I(P.E.Ratio^3)
## P.B.Ratio I(P.B.Ratio^2), I(P.B.Ratio^3)
## Market.Cap.Classification --
## Growth.vs.Value EPS
## Other Predictors
## EPS P.E.Ratio, P.B.Ratio, Market.Cap.Classification
## P.E.Ratio EPS, P.B.Ratio, Market.Cap.Classification, Growth.vs.Value
## P.B.Ratio EPS, P.E.Ratio, Market.Cap.Classification, Growth.vs.Value
## Market.Cap.Classification EPS, P.E.Ratio, P.B.Ratio, Growth.vs.Value
## Growth.vs.Value P.E.Ratio, P.B.Ratio, Market.Cap.Classification
The VIF test shows that there is no multicollinearity between our predictor variables. Obviously the polynomial and interaction_variable are going to be correlated which is why we use type “predictor” only. None of the values are greater than 5 so no multicollinearity. ### Marginal Effects Estimated
coefs <- coef(model_5)
marginal_effects <- coefs[-1] # for some reason in R this means all but the first (this language drives me insane)
barplot(
marginal_effects,
main = "Marginal Effects of Predictors",
ylab = "Marginal Effects",
las = 2,
col = "skyblue",
horiz = FALSE
)
effects_model_5 <- allEffects(model_5)
filtered_effects <- effects_model_5[!grepl("\\^", names(effects_model_5))]
# R HAS THE WORST ITERATOR SYNTAX I HAVE EVER SEEN :(((
for (i in seq_along(filtered_effects)) {
effect <- filtered_effects[i]
plot(effect)
}
print(data.frame(Variable = names(marginal_effects), MarginalEffect = marginal_effects))
## Variable
## EPS EPS
## P.E.Ratio P.E.Ratio
## P.B.Ratio P.B.Ratio
## Market.Cap.ClassificationMid Cap Market.Cap.ClassificationMid Cap
## Market.Cap.ClassificationSmall Cap Market.Cap.ClassificationSmall Cap
## Growth.vs.ValueValue Growth.vs.ValueValue
## I(EPS^2) I(EPS^2)
## I(EPS^3) I(EPS^3)
## I(P.E.Ratio^2) I(P.E.Ratio^2)
## I(P.E.Ratio^3) I(P.E.Ratio^3)
## I(P.B.Ratio^2) I(P.B.Ratio^2)
## I(P.B.Ratio^3) I(P.B.Ratio^3)
## EPS:Growth.vs.ValueValue EPS:Growth.vs.ValueValue
## MarginalEffect
## EPS 3.361210e+01
## P.E.Ratio 1.901050e+00
## P.B.Ratio 7.069327e+00
## Market.Cap.ClassificationMid Cap -1.294353e+01
## Market.Cap.ClassificationSmall Cap -6.207608e+00
## Growth.vs.ValueValue 2.597568e+01
## I(EPS^2) -5.682798e-01
## I(EPS^3) 6.706735e-03
## I(P.E.Ratio^2) -5.419157e-03
## I(P.E.Ratio^3) 3.705277e-06
## I(P.B.Ratio^2) -1.355201e-01
## I(P.B.Ratio^3) 6.499942e-04
## EPS:Growth.vs.ValueValue -1.265159e+01
As we can see from the marginal effects from Effects.allEffects all
of them are nicely linear except for P.B.Ratio and
P.E.Ratio. A smooth, non-linear relationship between the
variables and Recent.Close.Price is observed. The effect of the
variables increases and decreases at a non-constant rate as they change,
suggesting that a linear model does not adequately capture the
complexity of this relationship. We do have quadratic and cubic terms of
these variables allowing me to feel comfortable with proceeding with
this model and not excluding these variables. Additionally we already
tested a model without the linear versions of these terms and it had a
lower BIC score suggesting the current version of the model is still
statistically significant. ### Boostrapping
n_boot <- 1000 # Number of bootstrap samples
# we will create vectors to store all of the values that we need
alpha_b <- numeric(n_boot) # Intercept
beta_b <- numeric(n_boot) # Slope
d_alpha_b <- numeric(n_boot) # Confidence interval width for intercept
d_beta_b <- numeric(n_boot) # Confidence interval width for slope
bootstrap_r2 <- numeric(n_boot) # Storage for R² values
# Perform bootstrapping
set.seed(123) # For reproducibility
for (i in 1:n_boot) {
# Resample data with replacement
boot_indices <- sample(1:nrow(stock_data), nrow(stock_data), replace = TRUE)
boot_sample <- stock_data[boot_indices, ]
# Fit the OLS model on the bootstrap sample
reg_result <- lm(interaction_formula, data = boot_sample)
# Store the intercept and slope coefficients
alpha_b[i] <- reg_result$coefficients[1]
beta_b[i] <- reg_result$coefficients[2]
# Calculate and store confidence intervals
ci <- confint(reg_result, level = 0.95)
d_alpha_b[i] <- ci[1, 2] - ci[1, 1] # Width of confidence interval for intercept
d_beta_b[i] <- ci[2, 2] - ci[2, 1] # Width of confidence interval for slope
# Store R²
bootstrap_r2[i] <- summary(reg_result)$r.squared
}
# Set up plotting area
par(mfrow = c(2, 3)) # 2 rows, 3 columns for the plots
# Plot distribution of the intercept
hist(alpha_b, main = "Bootstrap Intercept", xlab = "Intercept", col = "lightblue", breaks = 30)
# Plot distribution of the slope
hist(beta_b, main = "Bootstrap Slope", xlab = "Slope", col = "lightgreen", breaks = 30)
# Plot distribution of confidence interval widths for intercept
hist(d_alpha_b, main = "CI Width for Intercept", xlab = "CI Width", col = "lightpink", breaks = 30)
# Plot distribution of confidence interval widths for slope
hist(d_beta_b, main = "CI Width for Slope", xlab = "CI Width", col = "lightyellow", breaks = 30)
# Plot distribution of R² values
hist(bootstrap_r2, main = "Bootstrap R²", xlab = "R²", col = "lightcoral", breaks = 30)
# Residuals plot for the original model
plot(fitted(model_5), resid(model_5), main = "Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals", col = "blue")
abline(h = 0, col = "red", lty = 2)
The bootstrapping results show stable estimates for the intercept,
slope, and R^2, with narrow confidence intervals for both the slope and
intercept. The model appears well-specified, with no major patterns in
the residuals. With this I will conclude testing for our model and
proceed to the Summary.
summary(model_5)
##
## Call:
## lm(formula = interaction_formula, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174.16 -20.52 1.16 17.95 579.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.159e+01 1.225e+01 -6.660 8.81e-11 ***
## EPS 3.361e+01 2.372e+00 14.173 < 2e-16 ***
## P.E.Ratio 1.901e+00 1.854e-01 10.253 < 2e-16 ***
## P.B.Ratio 7.069e+00 1.163e+00 6.078 2.77e-09 ***
## Market.Cap.ClassificationMid Cap -1.294e+01 7.060e+00 -1.833 0.067482 .
## Market.Cap.ClassificationSmall Cap -6.208e+00 9.757e+00 -0.636 0.524986
## Growth.vs.ValueValue 2.598e+01 1.192e+01 2.179 0.029927 *
## I(EPS^2) -5.683e-01 1.576e-01 -3.605 0.000351 ***
## I(EPS^3) 6.707e-03 2.479e-03 2.706 0.007097 **
## I(P.E.Ratio^2) -5.419e-03 6.533e-04 -8.295 1.56e-15 ***
## I(P.E.Ratio^3) 3.705e-06 5.067e-07 7.313 1.38e-12 ***
## I(P.B.Ratio^2) -1.355e-01 2.956e-02 -4.585 6.02e-06 ***
## I(P.B.Ratio^3) 6.500e-04 1.677e-04 3.875 0.000124 ***
## EPS:Growth.vs.ValueValue -1.265e+01 1.116e+00 -11.337 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.94 on 412 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8165
## F-statistic: 146.4 on 13 and 412 DF, p-value: < 2.2e-16
The objective of this analysis was to develop a model that predicts stock prices based on various financial variables. The results show that EPS, P/E ratio, and P/B ratio are the most influential factors on stock price, with higher values of these variables generally leading to higher stock prices. Specifically, a one-unit increase in EPS is associated with a rise of 33.61 in the stock’s recent close price, while higher P/E and P/B ratios also positively affect the stock price, albeit at a diminishing rate due to the inclusion of quadratic and cubic terms for these predictors. This suggests that extreme values of EPS and P/E ratios lead to diminishing returns, indicating that higher profitability and market valuation are not always linearly related to stock price increases. The Growth vs. Value classification also plays a role, with growth stocks commanding higher prices, showing that investors tend to value growth potential more than current book value. It was interesting to note that market cap was not as important as I would have assumed that a small cap stock would automatically indicate a stock price of lower than ~$40 or some other arbritary number. The interaction term between EPS and Growth vs. Value reveals that the positive effect of EPS on stock price is weaker for growth stocks, suggesting that while EPS is a key driver for stock price, its impact is moderated for growth stocks. The model explains 82% of the variance in stock prices, with the residuals indicating a good fit. I do want to note however this model is probably overfitted and further cross validation should be conducted before anyone uses this in field to inform their stock purchases. We also note from the Cook’s plot’s from before there is presence of a certain amount of outliers but, we decided to keep them in our final model. It is also interesting to note that the quadratic terms had negative marginal effects while the cubic terms had a positive marginal effect, I am not sure what this means statistically but it is something to point out.